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Abstract 

We present a Bayesian Neural Network algorithm implemented in the TMVA 
package [lj, within the ROOT framework 0]. Comparing to the conventional 
utilization of Neural Network as discriminator, this new implementation has 
more advantages as a non-parametric regression tool, particularly for fitting 
probabilities. It provides functionalities including cost function selection, com- 
plexity control and uncertainty estimation. An example of such application in 
High Energy Physics is shown. The algorithm is available with ROOT release 
later than 5.29. 
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Solution method: An implementation of Neural Network following the Bayesian sta- 
tistical interpretation. Uses Laplace approximation for the Bayesian marginalizations. 
Provides the functionalities of automatic complexity control and uncertainty estima- 
tion. 

Running time: Time consumption for the training depends substantially on the size 
of input sample, the NN topology, the number of training iterations, etc. For the 
example in this manuscript, about 7 minutes was used on a PC/Linux with 2.0GHz 
processors. 



1. Introduction 

Neural Network (NN) has been considerably utilized in High Energy Physics 
in the past decade. In most applications, The NN was used as a discriminator to 
separate signal from backgrounds. It is a powerful tool to extract the features of 
target categories in the multivariate phase space, and project them into a scalar 
discriminator. However, such applications are often criticized for the reliance on 
the simulation or test-beam data as training samples, which may have distinct 
features comparing to the real data. On the other hand, the usage of NN as 
a non-parametric regression tool is much less exploited, and usually does not 
suffer from such concerns. Given sufficient complexity, even a single-hidden- 
layer NN can be seen as a universal approximator of any nonlinear multivariate 
function [3|,|4|. Composed of simple nonlinear functions called "neurons" , such as 
hyperbolic tangent functions, an NN can achieve great complexity by connecting 
many neurons with variable weights w, which serves as the free parameters of 
the model. Equation (p} shows the analytical form of a typical NN, with its 
structure shown in figure [TJ 
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NN can approximate not only functions whose output values span real num- 
ber space, but also those with confined output values. One such category par- 
ticularly interesting is a probability. The output value can be confined within 
and 1 by applying a sigmoid transformation to the output neuron, i.e. replacing 
/( 3 ) in equation {T]) by 
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Figure 1: The structure of an NN with one hidden layer [ij. 



There are several advantages of NN comparing to the conventional represen- 
tation of probabilities by histograms. First, NN can approximate the distribu- 
tions in an unbinned manner, without the arbitrariness in the choice of binning 
and the subsequent loss of information. Second, NN is more practical for mul- 
tivariate approximation without suffering the curse of dimensionality. Another 
great advantage of NN as a regression tool is that the correlations between the 
input variables can be well approximated. 

Compared to the application as a "black-box" discriminator, NN applica- 
tion as a regression tool must follow a more explicit statistical interpretation. 
This is particularly important if the subsequent application requires detailed 
statistical information, such as limit setting for new physics searches. Here we 
present an NN algorithm following the Bayesian inference theory, implemented 
in the TMVA package [1]. This manuscript is organized as follows: section [5] 
is a brief review of the statistical interpretation for the training (fitting) and 
prediction procedure of the NN. Then the implementation of our Bayesian NN 
(BNN) algorithm is described in detail in section [3] Finally in section HJ an 
application of this BNN in High Energy Physics is demonstrated, where it is 
used to approximate the false identification rate of isolated muons. 

2. Statistical Interpretation of Training and Prediction 

As a generic model, NN normally has a huge number of degrees of freedom 
and incomprehensible complexity. On the other hand, the procedure to train 
an NN and makes predictions with it can be clearly interpreted by probability 
theory as Bayesian inference. 

Given an observed sample D — {xi , ij}, with Xi as the multiple input vari- 
ables of entry i, and U as its observed output value, the training (fitting) of 
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NN can be viewed as a process to determine the probability of free parameters 
w, based on D. According to the Bayes theorem, this "posterior" probabil- 
ity comes from the combination of previous knowledge of w ( "prior" ) and the 
compatibility of sample D with the NN ("likelihood") 

P(w|D) ocP(D|w)F(w) (3) 

The likelihood P(D|w) in Bayesian language is closely related the "cost 
function" in machine learning language. The optimization of NN by minimizing 
the cost function is mostly equivalent to maximizing the likelihood. For example, 
the most commonly used sum of the square error (SSE) function can actually 
be translated as the negative logarithm of the Gaussian likelihood, as shown by 
equation 

SSE = ^( 2/ (x i ;w)-t i ) 2 

i 

= - log(n(exp(-(j/(x i; w) - Uf) (4) 

i 

cx -log(P(D|w)) 

The prior probability P(w) is much less emphasized in classic usage of NN, 
which in fact often assumes a flat distribution. We will see later that a Bayesian 
regulator term can be added to the cost function, based on a simple prior knowl- 
edge. It is also worth mentioning that, although the probability distribution of 
w can be obtained, only the most probable value is kept in classic usage. 

Given a new input vector x', the prediction with NN can also be performed 
as a Bayesian inference. Although classic usage of NN only gives one single 
value of y' with the most probable w, we should be able to predict a proba- 
bility distribution of the output if we marginalize over the distribution of NN 
parameters P(w|D) by equation ([S]). 

P(y'\D,x') = J P(y'\x , w)P(w|£>) dw (5) 

3. Bayesian Implementations 

3.1. Cost Function 

As shown in equation (Q}, the commonly used cost function, sum of the 
square error, can be interpreted as the negative logarithm of Gaussian likelihood 
function. This cost function is applicable for most regression applications, where 
a Gaussian distribution can be assumed for the observed values around their 
true values. 

However, when the NN is used to approximate a probabilistic distribution, 
such an assumption of Gaussian likelihood may not be appropriate. As a prob- 
abilistic classifier, the NN's output y is expected to approximate the probability 
of membership to one category, constrained between and 1 by the sigmoid 
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function (equation fl2J)). And the observed value t for each entry in the input 
sample D is usually either or 1, representing the fact whether the entry meets 
the condition, or belongs to the desired category. The distribution of observation 
t around probability y should then follow the Bernoulli distribution 



Correspondingly, the cost function for classification should take the form of 
the so-called "cross-entropy" function (CE), the sum of negative logarithm of 
Bernoulli distribution, 



With the sigmoid transformation and CE cost function, the NN can approximate 
the probabilities with rigorous statistics interpretation. This can be chosen by 
the option of the MLP method Q EstimatorType=CE. 

3.2. Complexity Control 

NN gains the capability of universal approximation by a huge number of 
degrees of freedom. A typical single-hiddcn-layer NN, even only for a few in- 
put variables, could have O(10 2 ) free parameters w. A model with such great 
complexity may suffer from over-fitting. That is to say, it will approximate not 
only the desired connection between the input variables and the output value, 
but also the undesired fluctuations of the input sample. This is particularly an 
issue if the input sample has limited statistics. The predictivity of the model 
will be greatly deteriorated by exaggerated fluctuations. 

The commonly employed solution against over-fitting is the so-called "cross 
validation" technique. A fraction of the input sample, normally half of the 
statistics, is taken away from the training process and used as a test set. Over- 
fitting is identified during the process of training, if the cost function of the test 
set starts to increase. One problem with this technique is the possibility that 
the cost function of test set may have merely hit a local minimum. Another 
problem is the reduction of training sample size. This is a non-trivial drawback 
for cases in which over-fitting may occur, which normally have trouble with 
statistics already. 

In our implementation, another solution with regulators is adopted to avoid 
over-fitting. Although it is necessary to keep a large number of free parameters 
in order to make the model generic, the value of the parameters can be con- 
strained to reduce unnecessary complexity. This can be expressed as a Bayesian 
prior knowledge about the model, assuming the value of the free parameters w 
should be limited to the vicinity of zero. A Gaussian distribution centered at 
zero is used to represent such prior. Correspondingly, a "regulator" function 
can be obtained as the negative logarithm of this Gaussian prior for all NN 
parameters u>i. 



P(D\w)=y t (l-y) 



i—t 



(6) 



CE = lo S2/( x i; w ) - (! - U) log(l - y(x s ; w))) 



(7) 



Reg = -log(P(w)) =^(a lW , 2 ) 



(8) 
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Adding this regulator term into the cost function actually gives the negative 
logarithm of posterior probability P(w\D). In the BNN, this summed value is 
minimized instead to obtain the optimal w. 

In equation ©, the hyper-parameters an determine the range of w, con- 
sequently reflecting the knowledge of required complexity of the model. The 
values of oti are not necessarily the same for each Wi. From a topological point 
of view, all the neurons within one hidden layer are computationally exchange- 
able. So their outgoing weights share the same hyper-parameter. This is not 
applicable to the input variables because they normally have different impor- 
tance. Therefore, each group of weights originated from the same input neuron 
has its own a^. For the same reason, the bias neurons in each layer have inde- 
pendent hyper-parameters. 

Although we can categorize the weights and associate them to different 
hyper-parameters, in most cases we do not actually possess a priori knowledge 
about the complexity needed, namely the values of on . We implemented an itera- 
tive approach proposed by MacKay in 1992 [Bj], in which these hyper-parameters 
are estimated during the training of NN, by optimizing the "evidence" of the 
models: 



where the optimal w is determined by minimizing the cost function and the 
integral is evaluated approximately for a given a. After estimating the optimal 
a, a new optimal w is recalculated and the process is iterated, until desired 
convergence is reached. 

This functionality can be activated by the MLP option UseRegulator=true, 
together with the BFGS training method. 

3.3. Bayesian Prediction 

In the classic usage of NN, the prediction upon new input x' is the most 
probable value obtained with the most probable w. Instead, in Bayesian data 
analysis, it is more essential to marginalize over all possible values of w, and 
obtain the prediction as a probability distribution, as shown by equation ([5]). 
Such a prediction contains not only the most probable value, but also the un- 
certainty of the inference. The estimation of uncertainty is crucial, especially 
for extrapolated predictions in multivariate phase space. 

Unfortunately, as a common difficulty for most Bayesian applications, the in- 
tegration of equation ([5]) is generally non-trivial. Although the posterior of NN 
parameters P(w\D) is calculable for any w, the distribution for such "nuisance 
parameters" does not have a closed-form expression. In our implementation, 
an analytical distribution is used for the integration, which is the Laplace ap- 
proximation of the posterior around the optimal value w MP [(J, as shown in 
equation (TTU)). 





P(w\D) ~P(w MP |£>)exp 



( 



-Aw T AAw 

2 



) 



(10) 
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A represents the Hessian matrix of the cost function, namely the negative log- 
arithm of the posterior 

A = -VVlogP(w| J D)| w MP. (11) 

For feed-forward NN either with a linear output neuron and SSE cost func- 
tion, or with a sigmoid output neuron and CE cost function, its Hessian matrix 
A can be approximated and written consistently as 

A^/'g T g 

/' = dy/dh and g = V/i. h and y are the values before and after output neuron 
transformation. 

Furthermore, a linear dependence of h over weights w can be approximated 
as well. 

/i(x';w)~/i(x';w MP )+g-Aw (13) 
Therefore, the distribution of h can be calculated analytically as 

P(h\D,x')= J /i(x';w)F(w|L>)dw ^ 

=,7V(/ l (x';w MP ),g T A- 1 g) 

It is a Gaussian distribution with the mean value of the classic NN predic- 
tion. In addition, each prediction can give an associated uncertainty, which 
originates from the uncertainty of the NN parameters determination. For prob- 
ability fitting, the probability density function (p.d.f.) of the output y is a 
sigmoid-transformed Gaussian distribution of h (equation [2]). As a non-linear 
transformation, the sigmoid function will convert the symmetric error bar of h 
into an asymmetric one for y. 

Besides the Laplace approximation used in our implementation, the integra- 
tion can also be solved numerically by Markov Chain Monte Carlo (MCMC) 0- 
Compared to MCMC approach, the analytical approximation is easier for both 
training and prediction. The estimation of the Hessian matrix during the train- 
ing process can be activated by MLP option CalculateErrors=true. By con- 
figuring the Reader option Error=true, the Hessian matrix will be loaded for 
prediction. And the asymmetric uncertainties can be evaluated by Reader func- 
tions GetMVAErrorUpper () and GetMVAErrorLower () . 

4. Application in High Energy Physics 

To demonstrate the practical usage of the BNN, we will show an example 
in High Energy Physics: measurement of the false identification rate of isolated 
muons. The test data, job control scripts and instructions can be obtained from 
the CPC library. 
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Muons are an important electroweak signature in collider physics. According 
to their sources, they can be categorized by the so-called isolation condition, i.e. 
adjacent particle flow. Those "non-isolated" muons, which have accompanying 
particles flying in a similar direction, mostly come from semi-leptonic decay of 
heavy flavor quarks (b, c). The other "isolated" muons are more likely from 
electro-weak processes such as W boson production, and therefore taken as 
signals in many physics topics. To get rid of the former category from a mixed 
sample, an isolation cut is often imposed on the sum of transverse momenta, 
for all visible particles inside a cone around the muon. Here we choose a typical 
configuration, limiting transverse particle flow within AR — J (A7/) 2 + (A0) 2 < 
0.2 to be less than 15% of the muon's transverse momentum. 

Due to event-specific kinematics and detector response, there is always a 
considerable fraction of muons from "non-isolated sources" being falsely iden- 
tified as isolated muons. Such muons will contaminate the signal sample when 
a final state with isolated muons is expected. It is non-trivial to estimate this 
background accurately, for both new physics searches and precise measurements. 
Rather than relying on simulation, it is highly desirable to measure the false 
identification rate / = P(isol) directly from collision data. 

In the following, we demonstrate how BNN can be used for such false rate 
measurement, using Monte-Carlo simulation of heavy-flavor quark production 
(bb, cc) in the proton-proton collisions. The samples are produced by the Pythia 
generator Q, with 7 TeV center-of-mass energy as at LHC [§]. Muon pseudo- 
rapidity acceptance is assumed to be within 2.5, with the threshold of transverse 
momentum as above 10 GeV. The fiducial efficiency of the detector is assumed 
to be 100% for simplicity. 

Two samples are generated for comparison. The first sample requires sin- 
gle muon within the detector acceptance, while the second sample requires two 
such muons with the same charge. The former final state is dominated by the 
non-isolated sources, and therefore ideal for the measurement of fake rate. The 
latter final state instead is a typical channel in which new physics may man- 
ifest, and requires accurate estimation of the background contribution. With 
collision data, we need to measure the fake rate from the single muon events, 
then apply it to those same-charge di-muon events to estimate the background 
contribution 10]. With the MC samples mentioned above, we can study how 
to make the fake rate measurement compatible between these two channels. 

In a first attempt to measure the average false identification rates, we ob- 
tain a quite different values, 33.3(0. 1)±% from the single muon sample and 
19.8(0.2)% from the same-charge di-muon sample. This incompatibility is due 
to the fact that the probability of false identification, P(isol|x), is not constant 
over kinematic variables x. Two examples of such kinematic variables are 

• pt: the transverse momentum of the muon itself. 

• H'y: the scalar sum of transverse momenta for all visible particles in the 
event, and the measurable energy imbalance. 

The kinematic distributions in these two samples are quite distinct, as can be 
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Figure 2: Probability density distributions over px (left) and Ht (right), for muons in the 
single muon control sample (solid) and di-muon signal sample (circle). 



seen in figure [2J As a result, the observed average rates, marginalized by equa- 
tion (|15[) , turned out to be incompatible. In order to make a correct prediction 
in the signal region, it is important to measure the probabilities P(isol | x) rather 
than the marginalized rate. 



Our BNN implementation is used to perform this measurement of P(isol | x). 
About 50,000 muons in the single muon QCD events is used as input sample D. 
Their corresponding kinematic variables pt and Pt are declared as the input Xj 
of the NN. And a target value ti as 1 or is assigned, depending on whether the 
muon passed the isolation criteria. The NN is constructed with one hidden layer 
of ten neurons, and a sigmoid-transformed output neuron. As a probability fit, 
the training is configured to use the cross-entropy cost function of equation ([7]). 
In addition, the regulator mechanism is activated to prevent over-fitting. 

The fitted distribution P(isol | pr, Pt) can be visualized in figure [3] The cor- 
relation between the two variables is clearly fitted, with a smooth extrapolation 
into the peripheral phase space region. 

We then test this 2D false rate function with muons in the same-charge di- 
muon sample. Using the trained BNN, we can predict the probability of passing 
isolation criteria P(isol | pp, Pt) for each muon. Marginalizing the predicted 
probabilities over all the muons in this sample (30764 muons), we can predict 
that 6077 of them will pass the isolation cut, corresponding to an average false 
rate of 19.8%. This is very close to the actual number of passed muon, 6093, 
equivalent to an average false rate of 19.8(0.2)%. 

As described is section l3~3l the BNN can also calculate the uncertainty asso- 
ciated to each prediction, based on the uncertainty on the determination of the 
free parameters w. It reflects the statistical property of the training sample. To 
test this estimation, another 100 NNs are trained with the same network topol- 
ogy. But the input samples, as well as the random seeds for initialization, are 
different in each training. For every muon in the same-charge di-muon sample, 
we use the BNN to predict its probability of pass Pbnn , as well as the associated 
asymmetric error bars, denoted as Cg NN and Cg NN . As a comparison, we also 
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Figure 3: 2-dimensional false isolation probabilities, fitted by the BNN. 



use the "batch" NNs to make 100 predictions, and calculate their mean value 
-PBatch and standard deviation CBatch- Comparing the ratio between ctbnn and 
CBatch (figure 4(a) ) for all the muons, we can see that the BNN uncertainty is 
generally consistent with the standard deviation of the batch predictions. 




"sNl/'Wh l0 9 10 ( O BNN /P BNN) 

(a) (b) 

Figure 4: (a) The distribution of the ratios between <tbnn and "Batch, for all muons in the 
same-charge di-muon events. Positive value stands for o"bnn /°Batch an <3 negative value stands 
for o-bnn /"Batch- (b) The correlation between log 10 (<r B NN/-FWj) and log 10 ((r Batch /P Batch ). 
"BNN is the average of o-gNN and CTg NN - 



To further understand the precision of the BNN uncertainty estimation, 
the correlation between logi q(^bnn/^bnn) and log 10 (cr B atch/-pBatch) for all the 
muons are plotted in figure [4(b)] Good consistency can be observed when the 
relative uncertainties are less than ^30%, which is the case for a large fraction 
of the entries. For predictions with large relative uncertainty, BNN tends to 
over-estimate the uncertainty, as the approximation applied in the Bayesian 
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marginalization becomes less accurate. 

It is worthwhile to notice that the uncertainty estimated by BNN only ac- 
counts for the confidence in the determination of BNN parameters, evaluated 
based on the training sample used. The difference between the prediction and 
observation also involves the statistical fluctuations of the prediction sample, as 
well as the systematic uncertainties in the application, such as sample selections, 
the choice of parameterized variables and their measurement uncertainties. 

Furthermore, only two kinematic variables were considered in this truth-level 
study, and consistent false rate estimation has already been observed. In reality, 
there could be additional factors which considerably affect the false rate, due to 
detector effects and collision configuration. Fortunately, the measurement can 
be easily extended to a higher dimensional phase space with BNN as the fitting 
tool. 

5. Conclusion 

In this manuscript we presented a BNN algorithm which can be used as an 
unbinned fitting tool, which is particularly interesting for fitting probabilities. 
The Bayesian implementation also provides functionalities such as controlling 
unnecessary complexity, and uncertainty estimation. 

The demonstration with a HEP use case clearly showed the capability of 
BNN as an unbinned regression tool, especially if several input variables with 
correlation are involved. This technique has already been used to analyze the 
data collected by LHC in 2010 [10|. It has a very promising future for further 
applications, particularly for higher dimensional problems. 
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